Goto

Collaborating Authors

 backward compatibility


On the Shelf Life of Fine-Tuned LLM Judges: Future Proofing, Backward Compatibility, and Question Generalization

arXiv.org Artificial Intelligence

The LLM-as-a-judge paradigm is widely used in both evaluating free-text model responses and reward modeling for model alignment and finetuning. Recently, finetuning judges with judge-specific data has emerged as an often preferred choice over directly prompting frontier models as judges, as the former achieves better performance with smaller model sizes while being more robust to common biases. However, the standard evaluation ignores several practical concerns of finetuned judges regarding their real world deployment. In this paper, we identify and formalize three aspects that affect the shelf life of these judges: future proofing and backward compatibility - how well judges finetuned on responses by today's generator models perform on responses by future models or past models, as well as question generalization - how well judges generalize to unseen questions at test time. We study these three aspects in the math domain under a unified framework with varying train and test distributions, three SFT - and DPO-based finetun-ing algorithms and three different base models. Experiments suggest that future-proofing is challenging for most models, while backward compatibility is relatively easy, with DPO-trained models consistently improving performance. We further find that continual learning provides a more balanced adaptation to shifts between older and newer response distributions than training solely on stronger or weaker responses. Moreover, all models observe certain degrees of performance degradation when moving from questions seen during training to unseen ones, showing that current judges do not fully generalize to unseen questions. These findings provide insights into practical considerations for developing and deploying judge models in the face of ever-changing generators. Automatic evaluators have become a central part of the large language model (LLM) development cycle.


Backward Compatibility in Attributive Explanation and Enhanced Model Training Method

arXiv.org Artificial Intelligence

Model update is a crucial process in the operation of ML/AI systems. While updating a model generally enhances the average prediction performance, it also significantly impacts the explanations of predictions. In real-world applications, even minor changes in explanations can have detrimental consequences. To tackle this issue, this paper introduces BCX, a quantitative metric that evaluates the backward compatibility of feature attribution explanations between pre- and post-update models. BCX utilizes practical agreement metrics to calculate the average agreement between the explanations of pre- and post-update models, specifically among samples on which both models accurately predict. In addition, we propose BCXR, a BCX-aware model training method by designing surrogate losses which theoretically lower bounds agreement scores. Furthermore, we present a universal variant of BCXR that improves all agreement metrics, utilizing L2 distance among the explanations of the models. To validate our approach, we conducted experiments on eight real-world datasets, demonstrating that BCXR achieves superior trade-offs between predictive performances and BCX scores, showcasing the effectiveness of our BCXR methods.


Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models

arXiv.org Artificial Intelligence

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.


GitHub - GPflow/GPflow: Gaussian processes in TensorFlow

#artificialintelligence

GPflow is a package for building Gaussian process models in Python. It implements modern Gaussian process inference for composable kernels and likelihoods. GPflow builds on TensorFlow 2.4 and TensorFlow Probability for running computations, which allows fast execution on GPUs. The online documentation (latest release)/(develop) contains more details. It was originally created by James Hensman and Alexander G. de G. Matthews.


How I Refactored a Monolithic Code Base Into an Add-In Architecture

#artificialintelligence

Before my first professional job, I would listen to developers talk about what it was like to work on someone else's messy code that consisted of anti-patterns. They would tell horror stories. Then, I took my second assignment as a fresh Dotnet developer, and that horror was exactly what I had been scared of. My new job was to integrate engineering rule sets into an engineering application. The application was already developed and running with a library with three rule sets.


Why I switched from console gaming to PC gaming

PCWorld

I've been a gamer pretty much all of my life. When I was a kid, my dad taught me how to navigate the puzzles and defeat the bosses in Legend of Zelda: Link to the Past on the original Nintendo. Those are some of my earliest childhood memories, which is why console gaming will always have a special place in my heart. That said, now that I'm in my mid-thirties and value comfort and convenience above all else, I've mostly switched to PC gaming these days. I'm not going to lie, my argument for switching to PC gaming is mostly rooted in my persnickety personality.


Improving Prediction Backward-Compatiblility in NLP Model Upgrade with Gated Fusion

arXiv.org Artificial Intelligence

When upgrading neural models to a newer version, new errors that were not encountered in the legacy version can be introduced, known as regression errors. This inconsistent behavior during model upgrade often outweighs the benefits of accuracy gain and hinders the adoption of new models. To mitigate regression errors from model upgrade, distillation and ensemble have proven to be viable solutions without significant compromise in performance. Despite the progress, these approaches attained an incremental reduction in regression which is still far from achieving backward-compatible model upgrade. In this work, we propose a novel method, Gated Fusion, that promotes backward compatibility via learning to mix predictions between old and new models. Empirical results on two distinct model upgrade scenarios show that our method reduces the number of regression errors by 62% on average, outperforming the strongest baseline by an average of 25%.


Backward Compatibility During Data Updates by Weight Interpolation

arXiv.org Artificial Intelligence

Backward compatibility of model predictions is a desired property when updating a machine learning driven application. It allows to seamlessly improve the underlying model without introducing regression bugs. In classification tasks these bugs occur in the form of negative flips. This means an instance that was correctly classified by the old model is now classified incorrectly by the updated model. This has direct negative impact on the user experience of such systems e.g. a frequently used voice assistant query is suddenly misclassified. A common reason to update the model is when new training data becomes available and needs to be incorporated. Simply retraining the model with the updated data introduces the unwanted negative flips. We study the problem of regression during data updates and propose Backward Compatible Weight Interpolation (BCWI). This method interpolates between the weights of the old and new model and we show in extensive experiments that it reduces negative flips without sacrificing the improved accuracy of the new model. BCWI is straight forward to implement and does not increase inference cost. We also explore the use of importance weighting during interpolation and averaging the weights of multiple new models in order to further reduce negative flips.


On the Model of Computation: Counterpoint

Communications of the ACM

Andy Grove (Intel's business leader until 2004) termed "software spiral" the exceptionally resilient business model behind general-purpose CPUs. Application software is the defining component of SWS: Code written once could yet benefit from performance scaling of later CPU generations. SWS is comprised of several abstraction levels. The random access machine, or model (RAM) is most relevant for the current Counterpoint Viewpoint (CPV): each serial step of an algorithm features a basic operation taking unit time ("uniform cost" criterion). The RAM has long been the gold standard for algorithms and data structures.


An Empirical Analysis of Backward Compatibility in Machine Learning Systems

arXiv.org Machine Learning

In many applications of machine learning (ML), updates are performed with the goal of enhancing model performance. However, current practices for updating models rely solely on isolated, aggregate performance analyses, overlooking important dependencies, expectations, and needs in real-world deployments. We consider how updates, intended to improve ML models, can introduce new errors that can significantly affect downstream systems and users. For example, updates in models used in cloud-based classification services, such as image recognition, can cause unexpected erroneous behavior in systems that make calls to the services. Prior work has shown the importance of "backward compatibility" for maintaining human trust. We study challenges with backward compatibility across different ML architectures and datasets, focusing on common settings including data shifts with structured noise and ML employed in inferential pipelines. Our results show that (i) compatibility issues arise even without data shift due to optimization stochasticity, (ii) training on large-scale noisy datasets often results in significant decreases in backward compatibility even when model accuracy increases, and (iii) distributions of incompatible points align with noise bias, motivating the need for compatibility aware de-noising and robustness methods.